The bank is concerned about loan defaults, as they significantly impact profits. The current loan approval process is manual, effort-intensive, and prone to human error and biases. Although there have been attempts to automate this using heuristics, the focus now is on using data science and machine learning to make the process more efficient and unbiased. It's crucial that any automated system doesn't inherit the human biases. The bank wants to establish a credit scoring model compliant with the Equal Credit Opportunity Act's guidelines, using data from recent approved applicants. This model should be predictive but also interpretable to justify any loan rejections. The objective is to build a classification model to predict potential defaulters and recommend important loan approval features to the bank.
The primary goal is to build a classification model that predicts which clients are likely to default on their loans. Additionally, the model should provide recommendations to the bank about the important features to consider during the loan approval process.
The primary goal is to build a classification model that predicts which clients are likely to default on their loans. Additionally, the model should provide recommendations to the bank about the important features to consider during the loan approval process.
The Home Equity dataset (HMEQ) contains baseline and loan performance information for 5,960 recent home equity loans. The target (BAD) is a binary variable that indicates whether an applicant has ultimately defaulted or has been severely delinquent. This adverse outcome occurred in 1,189 cases (20 percent). 12 input variables were registered for each applicant.
BAD: 1 = Client defaulted on loan, 0 = loan repaid
LOAN: Amount of loan approved.
MORTDUE: Amount due on the existing mortgage.
VALUE: Current value of the property.
REASON: Reason for the loan request. (HomeImp = home improvement, DebtCon= debt consolidation which means taking out a new loan to pay off other liabilities and consumer debts)
JOB: The type of job that loan applicant has such as manager, self, etc.
YOJ: Years at present job.
DEROG: Number of major derogatory reports (which indicates a serious delinquency or late payments).
DELINQ: Number of delinquent credit lines (a line of credit becomes delinquent when a borrower does not make the minimum required payments 30 to 60 days past the day on which the payments were due).
CLAGE: Age of the oldest credit line in months.
NINQ: Number of recent credit inquiries.
CLNO: Number of existing credit lines.
DEBTINC: Debt-to-income ratio (all your monthly debt payments divided by your gross monthly income. This number is one way lenders measure your ability to manage the monthly payments to repay the money you plan to borrow.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# To scale the data using z-score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# Algorithms to use
from sklearn import tree
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
# Metrics to evaluate the model
from sklearn.metrics import confusion_matrix, classification_report, precision_recall_curve
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
# Metrics to evaluate the model
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report,recall_score,precision_score, accuracy_score
# For tuning the model
from sklearn.model_selection import GridSearchCV
# Setting the random seed to 1 for reproducibility of results
import random
random.seed(1)
np.random.seed(1)
# To ignore warnings
import warnings
warnings.filterwarnings("ignore")
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
#loading the datasets
data= pd.read_csv('/content/drive/MyDrive/Data science MIT/Capstone project/hmeq.csv')
#check the first 5 rows in the dataset
data.head()
| BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1100 | 25860.0 | 39025.0 | HomeImp | Other | 10.5 | 0.0 | 0.0 | 94.366667 | 1.0 | 9.0 | NaN |
| 1 | 1 | 1300 | 70053.0 | 68400.0 | HomeImp | Other | 7.0 | 0.0 | 2.0 | 121.833333 | 0.0 | 14.0 | NaN |
| 2 | 1 | 1500 | 13500.0 | 16700.0 | HomeImp | Other | 4.0 | 0.0 | 0.0 | 149.466667 | 1.0 | 10.0 | NaN |
| 3 | 1 | 1500 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 0 | 1700 | 97800.0 | 112000.0 | HomeImp | Office | 3.0 | 0.0 | 0.0 | 93.333333 | 0.0 | 14.0 | NaN |
#check the last 5 rows in the dataset
data.tail()
| BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5955 | 0 | 88900 | 57264.0 | 90185.0 | DebtCon | Other | 16.0 | 0.0 | 0.0 | 221.808718 | 0.0 | 16.0 | 36.112347 |
| 5956 | 0 | 89000 | 54576.0 | 92937.0 | DebtCon | Other | 16.0 | 0.0 | 0.0 | 208.692070 | 0.0 | 15.0 | 35.859971 |
| 5957 | 0 | 89200 | 54045.0 | 92924.0 | DebtCon | Other | 15.0 | 0.0 | 0.0 | 212.279697 | 0.0 | 15.0 | 35.556590 |
| 5958 | 0 | 89800 | 50370.0 | 91861.0 | DebtCon | Other | 14.0 | 0.0 | 0.0 | 213.892709 | 0.0 | 16.0 | 34.340882 |
| 5959 | 0 | 89900 | 48811.0 | 88934.0 | DebtCon | Other | 15.0 | 0.0 | 0.0 | 219.601002 | 0.0 | 16.0 | 34.571519 |
#check the number of rows and columns in the dataset
data.shape
(5960, 13)
data.dtypes
BAD int64 LOAN int64 MORTDUE float64 VALUE float64 REASON object JOB object YOJ float64 DEROG float64 DELINQ float64 CLAGE float64 NINQ float64 CLNO float64 DEBTINC float64 dtype: object
# Let us see the info of the data
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5960 entries, 0 to 5959 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 BAD 5960 non-null int64 1 LOAN 5960 non-null int64 2 MORTDUE 5442 non-null float64 3 VALUE 5848 non-null float64 4 REASON 5708 non-null object 5 JOB 5681 non-null object 6 YOJ 5445 non-null float64 7 DEROG 5252 non-null float64 8 DELINQ 5380 non-null float64 9 CLAGE 5652 non-null float64 10 NINQ 5450 non-null float64 11 CLNO 5738 non-null float64 12 DEBTINC 4693 non-null float64 dtypes: float64(9), int64(2), object(2) memory usage: 605.4+ KB
Observation
BAD variable (the target variable) which is of the int64 datatype and is categorical, the other dependent features have 10 numerical and 2 strings datatypes.Let's check the unique values in each column
# Checking the count of unique values in each column, nunique can give us information on columns that are categorial or numerical
data.nunique()
BAD 2 LOAN 540 MORTDUE 5053 VALUE 5381 REASON 2 JOB 6 YOJ 99 DEROG 11 DELINQ 14 CLAGE 5314 NINQ 16 CLNO 62 DEBTINC 4693 dtype: int64
Observations:
# Let check for missing values
data.duplicated().sum()
0
There are no duplicated rows in the data
#check for null values
data.isnull().sum()
BAD 0 LOAN 0 MORTDUE 518 VALUE 112 REASON 252 JOB 279 YOJ 515 DEROG 708 DELINQ 580 CLAGE 308 NINQ 510 CLNO 222 DEBTINC 1267 dtype: int64
#percentage of missing values in the dataframe
round(data.isnull().sum() / data.shape[0]*100,2).sort_values(ascending=True)
BAD 0.00 LOAN 0.00 VALUE 1.88 CLNO 3.72 REASON 4.23 JOB 4.68 CLAGE 5.17 NINQ 8.56 YOJ 8.64 MORTDUE 8.69 DELINQ 9.73 DEROG 11.88 DEBTINC 21.26 dtype: float64
Observation
### Percentage distribution of the target variable "BAD"
data['BAD'].value_counts(1)*100
0 80.050336 1 19.949664 Name: BAD, dtype: float64
Observation
data.describe().round(2).T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| BAD | 5960.0 | 0.20 | 0.40 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| LOAN | 5960.0 | 18607.97 | 11207.48 | 1100.00 | 11100.00 | 16300.00 | 23300.00 | 89900.00 |
| MORTDUE | 5442.0 | 73760.82 | 44457.61 | 2063.00 | 46276.00 | 65019.00 | 91488.00 | 399550.00 |
| VALUE | 5848.0 | 101776.05 | 57385.78 | 8000.00 | 66075.50 | 89235.50 | 119824.25 | 855909.00 |
| YOJ | 5445.0 | 8.92 | 7.57 | 0.00 | 3.00 | 7.00 | 13.00 | 41.00 |
| DEROG | 5252.0 | 0.25 | 0.85 | 0.00 | 0.00 | 0.00 | 0.00 | 10.00 |
| DELINQ | 5380.0 | 0.45 | 1.13 | 0.00 | 0.00 | 0.00 | 0.00 | 15.00 |
| CLAGE | 5652.0 | 179.77 | 85.81 | 0.00 | 115.12 | 173.47 | 231.56 | 1168.23 |
| NINQ | 5450.0 | 1.19 | 1.73 | 0.00 | 0.00 | 1.00 | 2.00 | 17.00 |
| CLNO | 5738.0 | 21.30 | 10.14 | 0.00 | 15.00 | 20.00 | 26.00 | 71.00 |
| DEBTINC | 4693.0 | 33.78 | 8.60 | 0.52 | 29.14 | 34.82 | 39.00 | 203.31 |
Observations from Summary Statistics
#check unique values in each columns
for column in data.columns:
print(f"Column: {column}")
print(data[column].value_counts(normalize=True))
print('-' * 50)
Column: BAD
0 0.800503
1 0.199497
Name: BAD, dtype: float64
--------------------------------------------------
Column: LOAN
15000 0.017617
10000 0.013591
20000 0.012416
25000 0.012248
12000 0.011577
...
47700 0.000168
47100 0.000168
46900 0.000168
46700 0.000168
89900 0.000168
Name: LOAN, Length: 540, dtype: float64
--------------------------------------------------
Column: MORTDUE
42000.0 0.002021
47000.0 0.001838
65000.0 0.001654
50000.0 0.001286
124000.0 0.001286
...
65372.0 0.000184
15346.0 0.000184
58549.0 0.000184
69195.0 0.000184
48811.0 0.000184
Name: MORTDUE, Length: 5053, dtype: float64
--------------------------------------------------
Column: VALUE
60000.0 0.002565
80000.0 0.002394
85000.0 0.002052
65000.0 0.001881
78000.0 0.001710
...
116994.0 0.000171
42682.0 0.000171
72175.0 0.000171
70095.0 0.000171
88934.0 0.000171
Name: VALUE, Length: 5381, dtype: float64
--------------------------------------------------
Column: REASON
DebtCon 0.688157
HomeImp 0.311843
Name: REASON, dtype: float64
--------------------------------------------------
Column: JOB
Other 0.420349
ProfExe 0.224608
Office 0.166872
Mgr 0.135011
Self 0.033973
Sales 0.019187
Name: JOB, dtype: float64
--------------------------------------------------
Column: YOJ
0.00 0.076217
1.00 0.066667
2.00 0.063728
5.00 0.061157
4.00 0.059504
...
29.90 0.000184
12.90 0.000184
13.50 0.000184
0.25 0.000184
8.30 0.000184
Name: YOJ, Length: 99, dtype: float64
--------------------------------------------------
Column: DEROG
0.0 0.861957
1.0 0.082826
2.0 0.030465
3.0 0.011043
4.0 0.004379
5.0 0.002856
6.0 0.002856
7.0 0.001523
8.0 0.001142
9.0 0.000571
10.0 0.000381
Name: DEROG, dtype: float64
--------------------------------------------------
Column: DELINQ
0.0 0.776766
1.0 0.121561
2.0 0.046468
3.0 0.023978
4.0 0.014498
5.0 0.007063
6.0 0.005019
7.0 0.002416
8.0 0.000929
10.0 0.000372
11.0 0.000372
15.0 0.000186
12.0 0.000186
13.0 0.000186
Name: DELINQ, dtype: float64
--------------------------------------------------
Column: CLAGE
102.500000 0.001238
206.966667 0.001238
177.500000 0.001062
123.766667 0.001062
95.366667 0.001062
...
240.856017 0.000177
196.241371 0.000177
71.461705 0.000177
184.880011 0.000177
219.601002 0.000177
Name: CLAGE, Length: 5314, dtype: float64
--------------------------------------------------
Column: NINQ
0.0 0.464404
1.0 0.245688
2.0 0.143119
3.0 0.071927
4.0 0.028624
5.0 0.013761
6.0 0.010275
7.0 0.008073
10.0 0.005138
8.0 0.004037
9.0 0.002018
11.0 0.001835
12.0 0.000367
13.0 0.000367
14.0 0.000183
17.0 0.000183
Name: NINQ, dtype: float64
--------------------------------------------------
Column: CLNO
16.0 0.055071
19.0 0.053503
24.0 0.046009
23.0 0.045138
21.0 0.040955
...
58.0 0.000523
71.0 0.000349
53.0 0.000349
57.0 0.000174
63.0 0.000174
Name: CLNO, Length: 62, dtype: float64
--------------------------------------------------
Column: DEBTINC
37.113614 0.000213
44.382578 0.000213
31.614680 0.000213
41.576701 0.000213
41.395462 0.000213
...
31.613930 0.000213
39.244669 0.000213
40.943866 0.000213
30.444839 0.000213
34.571519 0.000213
Name: DEBTINC, Length: 4693, dtype: float64
--------------------------------------------------
Leading Questions:
# Get list of numerical columns
num_cols = data.select_dtypes(include=['float64', 'int64']).columns.tolist()
print(num_cols)
['BAD', 'LOAN', 'MORTDUE', 'VALUE', 'YOJ', 'DEROG', 'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC']
# Get list of categorical columns
cat_cols = data.select_dtypes(include=['object']).columns.tolist()
print(cat_cols)
['REASON', 'JOB']
We will first define a hist_box() function that provides both a boxplot and a histogram in the same visual, with which we can perform univariate analysis on the columns of this dataset.
# Defining the hist_box() function
def hist_box(col):
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={'height_ratios': (0.15, 0.85)}, figsize=(15,10))
sns.set(style='darkgrid')
# Adding a graph in each part
sns.boxplot(x=data[col], ax=ax_box, showmeans=True)
sns.distplot(x=data[col], ax=ax_hist)
ax_hist.axvline(data[col].mean(), color='green', linestyle='--') # Green line corresponds to the mean in the plot
ax_hist.axvline(data[col].median(), color='orange', linestyle='-') # Orange line corresponds to the median in the plot
plt.show()
hist_box('LOAN')
Observation
hist_box('MORTDUE')
Observation
hist_box('VALUE')
Observation
hist_box('YOJ')
Observation
hist_box('DEROG')
Observation
hist_box('DELINQ')
Observation
hist_box('DEBTINC')
Observation
hist_box('NINQ')
Observation
hist_box('CLNO')
Observation
hist_box('CLAGE')
Observation
cat_cols
['REASON', 'JOB']
# Printing the % sub categories of each category.
for i in cat_cols:
print(data[i].value_counts(normalize = True))
print('*' * 40)
DebtCon 0.688157 HomeImp 0.311843 Name: REASON, dtype: float64 **************************************** Other 0.420349 ProfExe 0.224608 Office 0.166872 Mgr 0.135011 Self 0.033973 Sales 0.019187 Name: JOB, dtype: float64 ****************************************
Observation
others job category# Create subplots: 3 rows by 2 columns
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(8, 5))
# The ravel function returns a flattened array
ax = axes.ravel()
for i, col in enumerate(cat_cols):
sns.countplot(x=col, data=data, ax=ax[i], order=data[col].value_counts().index)
ax[i].set_title(f'Distribution of {col}')
ax[i].set_xticklabels(ax[i].get_xticklabels(), rotation=45)
# Adjust layout
plt.tight_layout()
plt.show()
otherrequested more loans than those in the other categoriesnum_cols
['BAD', 'LOAN', 'MORTDUE', 'VALUE', 'YOJ', 'DEROG', 'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC']
cols = data[['LOAN','MORTDUE', 'VALUE','YOJ','DEROG','DELINQ','CLAGE','NINQ','CLNO','DEBTINC']].columns.tolist()
plt.figure(figsize=(15,25))
for i, variable in enumerate(cols):
plt.subplot(5,2,i+1)
sns.boxplot(x=data["BAD"],y=data[variable],palette="PuBu")
plt.tight_layout()
plt.title(variable)
plt.show()
cat_cols
['REASON', 'JOB']
cols = data[['REASON', 'JOB']].columns.tolist()
plt.figure(figsize=(7, 5))
for i, variable in enumerate(cols):
# Crosstab to get the frequency for each category and status
ct = pd.crosstab(data[variable], data['BAD'])
# Sort the bars based on the count of 'BAD' value (e.g., 1)
# You can change the sorting column if needed
ct = ct.sort_values(by=1, ascending=False)
# Plotting the stacked bar chart
ax = plt.subplot(1, 2, i+1)
ct.plot(kind="bar", stacked=True, ax=ax)
plt.title(variable)
plt.ylabel('Count')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
Observation
other have higher chances of defaulting loan paymentsales have the least loan and are likely to repay the loan than clients under other job categories#lets create a pivot table to show the summary of the data based on the categorical variables
pivot_table = pd.pivot_table(data, index=['REASON', 'JOB'], columns='BAD')
pivot_table
| CLAGE | CLNO | DEBTINC | DELINQ | DEROG | LOAN | MORTDUE | NINQ | VALUE | YOJ | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| BAD | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | |
| REASON | JOB | ||||||||||||||||||||
| DebtCon | Mgr | 179.203028 | 146.791449 | 23.420225 | 23.534483 | 35.129543 | 38.591110 | 0.234414 | 1.271930 | 0.151832 | 0.956140 | 21022.149123 | 19923.275862 | 84201.617978 | 80240.646552 | 1.646081 | 2.079646 | 106375.289474 | 115403.442308 | 9.382547 | 8.863158 |
| Office | 179.587257 | 138.171649 | 23.013258 | 20.488636 | 34.196836 | 39.559065 | 0.346392 | 1.558140 | 0.060606 | 0.469880 | 18994.696970 | 19409.782609 | 68879.309804 | 70595.511364 | 1.051485 | 1.954545 | 96780.312500 | 100250.329390 | 6.907129 | 8.965761 | |
| Other | 171.000227 | 150.057130 | 19.395883 | 21.034161 | 34.190252 | 37.604921 | 0.199647 | 1.148734 | 0.227599 | 0.563107 | 20922.974036 | 17758.858859 | 61802.455794 | 58077.646624 | 1.322747 | 1.981132 | 86781.949693 | 84135.985669 | 9.707401 | 7.760749 | |
| ProfExe | 200.220297 | 174.842451 | 25.326211 | 27.396825 | 32.142007 | 44.358021 | 0.247407 | 1.404762 | 0.104294 | 0.701613 | 19983.426184 | 21089.147287 | 93087.435537 | 103977.927419 | 0.916910 | 1.873016 | 128021.669916 | 136506.228814 | 8.619829 | 9.067442 | |
| Sales | 227.072356 | 152.000930 | 23.015873 | 28.058824 | 33.461594 | 51.797315 | 0.206349 | 0.566667 | 0.018182 | 0.875000 | 14976.190476 | 18188.235294 | 87275.259259 | 75957.794118 | 0.174603 | 1.939394 | 121789.619048 | 92922.303030 | 6.986792 | 7.061765 | |
| Self | 198.224972 | 147.821700 | 19.707317 | 26.875000 | 33.844055 | 45.729450 | 0.121951 | 1.483871 | 0.024390 | 0.586207 | 36419.512195 | 20381.250000 | 130548.902439 | 76569.266667 | 1.000000 | 1.500000 | 179076.560976 | 108193.370370 | 7.219512 | 5.650000 | |
| HomeImp | Mgr | 188.321009 | 134.204602 | 19.830508 | 20.903846 | 34.068242 | 39.192406 | 0.564815 | 1.519231 | 0.070000 | 0.600000 | 15305.084746 | 11855.357143 | 78330.737593 | 71761.224490 | 0.472222 | 1.480769 | 115008.510678 | 86740.430000 | 9.227966 | 8.282143 |
| Office | 188.823871 | 190.618420 | 18.914179 | 18.300000 | 33.414808 | 35.834021 | 0.169355 | 1.766667 | 0.084034 | 1.275862 | 16920.522388 | 11387.878788 | 65897.657895 | 47071.630000 | 0.421260 | 0.740741 | 92287.193433 | 71105.586207 | 11.735537 | 7.054545 | |
| Other | 202.190793 | 129.498015 | 19.471058 | 14.222798 | 33.337985 | 37.020812 | 0.285403 | 0.794737 | 0.145652 | 0.656085 | 14072.994129 | 10938.048780 | 56660.921951 | 45106.493506 | 0.943277 | 1.361702 | 78509.387613 | 80022.457068 | 10.393607 | 6.926111 | |
| ProfExe | 209.613605 | 173.708747 | 22.602410 | 20.985294 | 31.838539 | 31.946864 | 0.256024 | 1.179104 | 0.009259 | 0.735294 | 17031.325301 | 14605.479452 | 99247.081325 | 83517.714286 | 0.651235 | 1.484375 | 140341.316265 | 122466.220588 | 8.910542 | 9.217361 | |
| Sales | 159.964610 | 133.731617 | 22.500000 | 32.000000 | 41.252041 | 25.650741 | 0.000000 | 0.000000 | 1.625000 | 0.500000 | 5800.000000 | 4325.000000 | 69501.000000 | 97638.333333 | 0.500000 | 0.500000 | 68965.250000 | 105413.000000 | 3.750000 | 8.250000 | |
| Self | 177.498642 | 172.032554 | 24.000000 | 19.222222 | 32.338956 | 61.376969 | 0.000000 | 2.444444 | 0.159574 | 0.333333 | 27380.851064 | 28852.380952 | 92906.845238 | 104220.950000 | 1.159420 | 2.611111 | 149464.521277 | 147892.380952 | 7.227273 | 7.285714 | |
#lets create a pivot table to show the summary of the data
pivot_table = pd.pivot_table(data, 'LOAN', index=['REASON', 'JOB'], columns='BAD').round(2)
pivot_table
| BAD | 0 | 1 | |
|---|---|---|---|
| REASON | JOB | ||
| DebtCon | Mgr | 21022.15 | 19923.28 |
| Office | 18994.70 | 19409.78 | |
| Other | 20922.97 | 17758.86 | |
| ProfExe | 19983.43 | 21089.15 | |
| Sales | 14976.19 | 18188.24 | |
| Self | 36419.51 | 20381.25 | |
| HomeImp | Mgr | 15305.08 | 11855.36 |
| Office | 16920.52 | 11387.88 | |
| Other | 14072.99 | 10938.05 | |
| ProfExe | 17031.33 | 14605.48 | |
| Sales | 5800.00 | 4325.00 | |
| Self | 27380.85 | 28852.38 |
import seaborn as sns
import matplotlib.pyplot as plt
# Plotting the heatmap
plt.figure(figsize=(12, 8)) # You can adjust the size as needed
sns.heatmap(pivot_table, annot=True, cmap='YlGnBu', fmt='g', cbar_kws={'label': 'Average Loan'})
plt.title('Loan Default by Reason and Job')
plt.show()
Observation
#lets create a pivot table to show the summary of the data
pivot_table = pd.pivot_table(data, 'MORTDUE', index=['REASON', 'JOB'], columns='BAD').round(2)
pivot_table
| BAD | 0 | 1 | |
|---|---|---|---|
| REASON | JOB | ||
| DebtCon | Mgr | 84201.62 | 80240.65 |
| Office | 68879.31 | 70595.51 | |
| Other | 61802.46 | 58077.65 | |
| ProfExe | 93087.44 | 103977.93 | |
| Sales | 87275.26 | 75957.79 | |
| Self | 130548.90 | 76569.27 | |
| HomeImp | Mgr | 78330.74 | 71761.22 |
| Office | 65897.66 | 47071.63 | |
| Other | 56660.92 | 45106.49 | |
| ProfExe | 99247.08 | 83517.71 | |
| Sales | 69501.00 | 97638.33 | |
| Self | 92906.85 | 104220.95 |
import seaborn as sns
import matplotlib.pyplot as plt
# Plotting the heatmap
plt.figure(figsize=(12, 8)) # You can adjust the size as needed
sns.heatmap(pivot_table, annot=True, cmap='YlGnBu', fmt='g', cbar_kws={'label': 'Average MORTDUE'})
plt.title('Loan Default by Reason and Job')
plt.show()
Observation
#lets create a pivot table to show the summary of the data
pivot_table = pd.pivot_table(data, 'VALUE', index=['REASON', 'JOB'], columns='BAD').round(2)
pivot_table
| BAD | 0 | 1 | |
|---|---|---|---|
| REASON | JOB | ||
| DebtCon | Mgr | 106375.29 | 115403.44 |
| Office | 96780.31 | 100250.33 | |
| Other | 86781.95 | 84135.99 | |
| ProfExe | 128021.67 | 136506.23 | |
| Sales | 121789.62 | 92922.30 | |
| Self | 179076.56 | 108193.37 | |
| HomeImp | Mgr | 115008.51 | 86740.43 |
| Office | 92287.19 | 71105.59 | |
| Other | 78509.39 | 80022.46 | |
| ProfExe | 140341.32 | 122466.22 | |
| Sales | 68965.25 | 105413.00 | |
| Self | 149464.52 | 147892.38 |
import seaborn as sns
import matplotlib.pyplot as plt
# Plotting the heatmap
plt.figure(figsize=(12, 8)) # You can adjust the size as needed
sns.heatmap(pivot_table, annot=True, cmap='YlGnBu', fmt='g', cbar_kws={'label': 'Average VALUE'})
plt.title('Loan Default by Reason and Job')
plt.show()
Observation
#lets create a pivot table to show the summary of the data
pivot_table = pd.pivot_table(data, 'YOJ', index=['REASON', 'JOB'], columns='BAD').round(2)
pivot_table
| BAD | 0 | 1 | |
|---|---|---|---|
| REASON | JOB | ||
| DebtCon | Mgr | 9.38 | 8.86 |
| Office | 6.91 | 8.97 | |
| Other | 9.71 | 7.76 | |
| ProfExe | 8.62 | 9.07 | |
| Sales | 6.99 | 7.06 | |
| Self | 7.22 | 5.65 | |
| HomeImp | Mgr | 9.23 | 8.28 |
| Office | 11.74 | 7.05 | |
| Other | 10.39 | 6.93 | |
| ProfExe | 8.91 | 9.22 | |
| Sales | 3.75 | 8.25 | |
| Self | 7.23 | 7.29 |
import seaborn as sns
import matplotlib.pyplot as plt
# Plotting the heatmap
plt.figure(figsize=(12, 8)) # You can adjust the size as needed
sns.heatmap(pivot_table, annot=True, cmap='YlGnBu', fmt='g', cbar_kws={'label': 'Average YOJ'})
plt.title('Loan Default by Reason and Job')
plt.show()
Observation
#lets create a pivot table to show the summary of the data
pivot_table = pd.pivot_table(data, 'DEROG', index=['REASON', 'JOB'], columns='BAD').round(2)
pivot_table
| BAD | 0 | 1 | |
|---|---|---|---|
| REASON | JOB | ||
| DebtCon | Mgr | 0.15 | 0.96 |
| Office | 0.06 | 0.47 | |
| Other | 0.23 | 0.56 | |
| ProfExe | 0.10 | 0.70 | |
| Sales | 0.02 | 0.88 | |
| Self | 0.02 | 0.59 | |
| HomeImp | Mgr | 0.07 | 0.60 |
| Office | 0.08 | 1.28 | |
| Other | 0.15 | 0.66 | |
| ProfExe | 0.01 | 0.74 | |
| Sales | 1.62 | 0.50 | |
| Self | 0.16 | 0.33 |
import seaborn as sns
import matplotlib.pyplot as plt
# Plotting the heatmap
plt.figure(figsize=(12, 8)) # You can adjust the size as needed
sns.heatmap(pivot_table, annot=True, cmap='YlGnBu', fmt='g', cbar_kws={'label': 'Average DEROG'})
plt.title('Loan Default by Reason and Job')
plt.show()
Observation
#lets create a pivot table to show the summary of the data
pivot_table = pd.pivot_table(data, 'DELINQ', index=['REASON', 'JOB'], columns='BAD').round(2)
pivot_table
| BAD | 0 | 1 | |
|---|---|---|---|
| REASON | JOB | ||
| DebtCon | Mgr | 0.23 | 1.27 |
| Office | 0.35 | 1.56 | |
| Other | 0.20 | 1.15 | |
| ProfExe | 0.25 | 1.40 | |
| Sales | 0.21 | 0.57 | |
| Self | 0.12 | 1.48 | |
| HomeImp | Mgr | 0.56 | 1.52 |
| Office | 0.17 | 1.77 | |
| Other | 0.29 | 0.79 | |
| ProfExe | 0.26 | 1.18 | |
| Sales | 0.00 | 0.00 | |
| Self | 0.00 | 2.44 |
import seaborn as sns
import matplotlib.pyplot as plt
# Plotting the heatmap
plt.figure(figsize=(12, 8)) # You can adjust the size as needed
sns.heatmap(pivot_table, annot=True, cmap='YlGnBu', fmt='g', cbar_kws={'label': 'Average DELINQ'})
plt.title('Loan Default by Reason and Job')
plt.show()
Observation
#lets create a pivot table to show the summary of the data
pivot_table = pd.pivot_table(data, 'CLAGE', index=['REASON', 'JOB'], columns='BAD').round(2)
pivot_table
| BAD | 0 | 1 | |
|---|---|---|---|
| REASON | JOB | ||
| DebtCon | Mgr | 179.20 | 146.79 |
| Office | 179.59 | 138.17 | |
| Other | 171.00 | 150.06 | |
| ProfExe | 200.22 | 174.84 | |
| Sales | 227.07 | 152.00 | |
| Self | 198.22 | 147.82 | |
| HomeImp | Mgr | 188.32 | 134.20 |
| Office | 188.82 | 190.62 | |
| Other | 202.19 | 129.50 | |
| ProfExe | 209.61 | 173.71 | |
| Sales | 159.96 | 133.73 | |
| Self | 177.50 | 172.03 |
import seaborn as sns
import matplotlib.pyplot as plt
# Plotting the heatmap
plt.figure(figsize=(12, 8)) # You can adjust the size as needed
sns.heatmap(pivot_table, annot=True, cmap='YlGnBu', fmt='g', cbar_kws={'label': 'Average CLAGE'})
plt.title('Loan Default by Reason and Job')
plt.show()
Observation
#lets create a pivot table to show the summary of the data
pivot_table = pd.pivot_table(data, 'NINQ', index=['REASON', 'JOB'], columns='BAD').round(2)
pivot_table
| BAD | 0 | 1 | |
|---|---|---|---|
| REASON | JOB | ||
| DebtCon | Mgr | 1.65 | 2.08 |
| Office | 1.05 | 1.95 | |
| Other | 1.32 | 1.98 | |
| ProfExe | 0.92 | 1.87 | |
| Sales | 0.17 | 1.94 | |
| Self | 1.00 | 1.50 | |
| HomeImp | Mgr | 0.47 | 1.48 |
| Office | 0.42 | 0.74 | |
| Other | 0.94 | 1.36 | |
| ProfExe | 0.65 | 1.48 | |
| Sales | 0.50 | 0.50 | |
| Self | 1.16 | 2.61 |
import seaborn as sns
import matplotlib.pyplot as plt
# Plotting the heatmap
plt.figure(figsize=(12, 8)) # You can adjust the size as needed
sns.heatmap(pivot_table, annot=True, cmap='YlGnBu', fmt='g', cbar_kws={'label': 'Average NINQ'})
plt.title('Loan Default by Reason and Job')
plt.show()
Observation
#lets create a pivot table to show the summary of the data
pivot_table = pd.pivot_table(data, 'CLNO', index=['REASON', 'JOB'], columns='BAD').round(2)
pivot_table
| BAD | 0 | 1 | |
|---|---|---|---|
| REASON | JOB | ||
| DebtCon | Mgr | 23.42 | 23.53 |
| Office | 23.01 | 20.49 | |
| Other | 19.40 | 21.03 | |
| ProfExe | 25.33 | 27.40 | |
| Sales | 23.02 | 28.06 | |
| Self | 19.71 | 26.88 | |
| HomeImp | Mgr | 19.83 | 20.90 |
| Office | 18.91 | 18.30 | |
| Other | 19.47 | 14.22 | |
| ProfExe | 22.60 | 20.99 | |
| Sales | 22.50 | 32.00 | |
| Self | 24.00 | 19.22 |
import seaborn as sns
import matplotlib.pyplot as plt
# Plotting the heatmap
plt.figure(figsize=(12, 8)) # You can adjust the size as needed
sns.heatmap(pivot_table, annot=True, cmap='YlGnBu', fmt='g', cbar_kws={'label': 'Average CLNO'})
plt.title('Loan Default by Reason and Job')
plt.show()
Observation
#lets create a pivot table to show the summary of the data
pivot_table = pd.pivot_table(data, 'DEBTINC', index=['REASON', 'JOB'], columns='BAD').round(2)
pivot_table
| BAD | 0 | 1 | |
|---|---|---|---|
| REASON | JOB | ||
| DebtCon | Mgr | 35.13 | 38.59 |
| Office | 34.20 | 39.56 | |
| Other | 34.19 | 37.60 | |
| ProfExe | 32.14 | 44.36 | |
| Sales | 33.46 | 51.80 | |
| Self | 33.84 | 45.73 | |
| HomeImp | Mgr | 34.07 | 39.19 |
| Office | 33.41 | 35.83 | |
| Other | 33.34 | 37.02 | |
| ProfExe | 31.84 | 31.95 | |
| Sales | 41.25 | 25.65 | |
| Self | 32.34 | 61.38 |
import seaborn as sns
import matplotlib.pyplot as plt
# Plotting the heatmap
plt.figure(figsize=(12, 8)) # You can adjust the size as needed
sns.heatmap(pivot_table, annot=True, cmap='YlGnBu', fmt='g', cbar_kws={'label': 'Average DEBTINC'})
plt.title('Loan Default by Reason and Job')
plt.show()
Observation
General Observations
#check for correlation among the numerical variables
data.corr()
| BAD | LOAN | MORTDUE | VALUE | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| BAD | 1.000000 | -0.075099 | -0.048219 | -0.029954 | -0.060238 | 0.276081 | 0.354107 | -0.170499 | 0.174980 | -0.004157 | 0.199835 |
| LOAN | -0.075099 | 1.000000 | 0.228595 | 0.335393 | 0.105728 | -0.001302 | -0.035144 | 0.088540 | 0.044487 | 0.072631 | 0.084735 |
| MORTDUE | -0.048219 | 0.228595 | 1.000000 | 0.875666 | -0.088480 | -0.049516 | -0.001044 | 0.140047 | 0.031392 | 0.323631 | 0.154939 |
| VALUE | -0.029954 | 0.335393 | 0.875666 | 1.000000 | 0.007759 | -0.048689 | -0.014103 | 0.171238 | -0.004398 | 0.269149 | 0.132174 |
| YOJ | -0.060238 | 0.105728 | -0.088480 | 0.007759 | 1.000000 | -0.065879 | 0.038153 | 0.201718 | -0.071627 | 0.024838 | -0.055891 |
| DEROG | 0.276081 | -0.001302 | -0.049516 | -0.048689 | -0.065879 | 1.000000 | 0.211832 | -0.083047 | 0.173934 | 0.061884 | 0.017065 |
| DELINQ | 0.354107 | -0.035144 | -0.001044 | -0.014103 | 0.038153 | 0.211832 | 1.000000 | 0.022488 | 0.067812 | 0.164639 | 0.052364 |
| CLAGE | -0.170499 | 0.088540 | 0.140047 | 0.171238 | 0.201718 | -0.083047 | 0.022488 | 1.000000 | -0.116935 | 0.237987 | -0.046477 |
| NINQ | 0.174980 | 0.044487 | 0.031392 | -0.004398 | -0.071627 | 0.173934 | 0.067812 | -0.116935 | 1.000000 | 0.088389 | 0.141344 |
| CLNO | -0.004157 | 0.072631 | 0.323631 | 0.269149 | 0.024838 | 0.061884 | 0.164639 | 0.237987 | 0.088389 | 1.000000 | 0.185539 |
| DEBTINC | 0.199835 | 0.084735 | 0.154939 | 0.132174 | -0.055891 | 0.017065 | 0.052364 | -0.046477 | 0.141344 | 0.185539 | 1.000000 |
# Finding the correlation between various columns of the dataset
plt.figure(figsize = (10,7))
sns.heatmap(data.corr(), annot = True, vmin = -1, vmax = 1, fmt = ".2f", cmap = "Spectral")
<Axes: >
There is a strong correlation between the VALUE and the MORTDUE
num_cols
['BAD', 'LOAN', 'MORTDUE', 'VALUE', 'YOJ', 'DEROG', 'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC']
# outlier detection using boxplot
# selecting the numerical columns of data and adding their names in a list
num_cols = ['LOAN','MORTDUE','VALUE','YOJ','DEROG','DELINQ','CLAGE','NINQ','CLNO','DEBTINC']
plt.figure(figsize=(15, 12))
for i, variable in enumerate(num_cols):
plt.subplot(4, 4, i + 1)
plt.boxplot(data[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
# to find the 25th percentile and 75th percentile for the numerical columns.
Q1 = data[num_cols].quantile(0.25)
Q3 = data[num_cols].quantile(0.75)
IQR = Q3 - Q1 #Inter Quantile Range (75th percentile - 25th percentile)
lower_whisker = Q1 - 1.5*IQR #Finding lower and upper bounds for all values. All values outside these bounds are outliers
upper_whisker = Q3 + 1.5*IQR
# Percentage of outliers in each column
((data[num_cols] < lower_whisker) | (data[num_cols] > upper_whisker)).sum()/data.shape[0]*100
LOAN 4.295302 MORTDUE 3.926174 VALUE 5.369128 YOJ 1.526846 DEROG 12.164430 DELINQ 20.151007 CLAGE 0.788591 NINQ 2.969799 CLNO 3.674497 DEBTINC 1.577181 dtype: float64
Observation
Treating outliers
We will cap/clip the minimum and maximum value of these columns to the lower and upper whisker value of the boxplot found using Q1 - 1.5*IQR and Q3 + 1.5*IQR, respectively.
Note: Generally, a value of 1.5 * IQR is taken to cap the values of outliers to upper and lower whiskers but any number (example 0.5, 2, 3, etc) other than 1.5 can be chosen. The value depends upon the business problem statement.
Creating a function to floor and cap/clip outliers in a column
def treat_outliers(df, col):
"""
treats outliers in a variable
col: str, name of the numerical variable
df: dataframe
col: name of the column
"""
Q1 = df[col].quantile(0.25) # 25th quantile
Q3 = df[col].quantile(0.75) # 75th quantile
IQR = Q3 - Q1 # Inter Quantile Range (75th perentile - 25th percentile)
lower_whisker = Q1 - 1.5 * IQR
upper_whisker = Q3 + 1.5 * IQR
# all the values smaller than lower_whisker will be assigned the value of lower_whisker
# all the values greater than upper_whisker will be assigned the value of upper_whisker
# the assignment will be done by using the clip function of NumPy
df[col] = np.clip(df[col], lower_whisker, upper_whisker)
return df
Treating outliers in Rooms column
data = treat_outliers(data,'LOAN')
# visualizing the column after outlier treatment
sns.boxplot(data=data,x='LOAN')
plt.show()
data = treat_outliers(data,'MORTDUE')
# visualizing the column after outlier treatment
sns.boxplot(data=data,x='MORTDUE')
plt.show()
data = treat_outliers(data,'VALUE')
# visualizing the column after outlier treatment
sns.boxplot(data=data,x='VALUE')
plt.show()
data = treat_outliers(data,'YOJ')
# visualizing the column after outlier treatment
sns.boxplot(data=data,x='YOJ')
plt.show()
data = treat_outliers(data,'DEROG')
# visualizing the column after outlier treatment
sns.boxplot(data=data,x='DEROG')
plt.show()
data = treat_outliers(data,'DELINQ')
# visualizing the column after outlier treatment
sns.boxplot(data=data,x='DELINQ')
plt.show()
data = treat_outliers(data,'CLAGE')
# visualizing the column after outlier treatment
sns.boxplot(data=data,x='CLAGE')
plt.show()
data = treat_outliers(data,'NINQ')
# visualizing the column after outlier treatment
sns.boxplot(data=data,x='NINQ')
plt.show()
data = treat_outliers(data,'CLNO')
# visualizing the column after outlier treatment
sns.boxplot(data=data,x='CLNO')
plt.show()
data = treat_outliers(data,'DEBTINC')
# visualizing the column after outlier treatment
sns.boxplot(data=data,x='DEBTINC')
plt.show()
Observation
data.head()
| BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1100 | 25860.0 | 39025.0 | HomeImp | Other | 10.5 | 0.0 | 0.0 | 94.366667 | 1.0 | 9.0 | NaN |
| 1 | 1 | 1300 | 70053.0 | 68400.0 | HomeImp | Other | 7.0 | 0.0 | 0.0 | 121.833333 | 0.0 | 14.0 | NaN |
| 2 | 1 | 1500 | 13500.0 | 16700.0 | HomeImp | Other | 4.0 | 0.0 | 0.0 | 149.466667 | 1.0 | 10.0 | NaN |
| 3 | 1 | 1500 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 0 | 1700 | 97800.0 | 112000.0 | HomeImp | Office | 3.0 | 0.0 | 0.0 | 93.333333 | 0.0 | 14.0 | NaN |
data.isnull().sum()
BAD 0 LOAN 0 MORTDUE 518 VALUE 112 REASON 252 JOB 279 YOJ 515 DEROG 708 DELINQ 580 CLAGE 308 NINQ 510 CLNO 222 DEBTINC 1267 dtype: int64
#percentage of missing values in the dataframe
round(data.isnull().sum()[data.isnull().sum()>0] / len(data)*100,2).sort_values(ascending=True)
VALUE 1.88 CLNO 3.72 REASON 4.23 JOB 4.68 CLAGE 5.17 NINQ 8.56 YOJ 8.64 MORTDUE 8.69 DELINQ 9.73 DEROG 11.88 DEBTINC 21.26 dtype: float64
# Calculate the percentage of missing values for each column
missing_percentage = (data.isnull().sum() / len(data)) * 100
# Sort the values in descending order for better visualization
missing_percentage_sorted = missing_percentage.sort_values(ascending=False)
# Plot the bar graph
plt.figure(figsize=(12, 8))
sns.barplot(x=missing_percentage_sorted.index, y=missing_percentage_sorted.values)
plt.xticks(rotation=45)
plt.ylabel('Percentage (%)')
plt.title('Percentage of Missing Values by Column')
plt.show()
Observation
BAD and LOAN have no missing valuesFilling missing values for categorial values
# extracting all the information of other variables where Distance is null
data.loc[data['REASON'].isnull()==True]
| BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | 1 | 1500 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 10 | 1 | 2000 | 22608.0 | NaN | NaN | NaN | 18.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 13 | 0 | 2000 | 64536.0 | 87400.000 | NaN | Mgr | 2.5 | 0.0 | 0.0 | 147.133333 | 0.0 | 24.0 | NaN |
| 17 | 1 | 2200 | 23030.0 | NaN | NaN | NaN | 19.0 | NaN | NaN | NaN | NaN | NaN | 14.345367 |
| 51 | 0 | 3100 | NaN | 70400.000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5435 | 1 | 32000 | 92400.0 | 200447.375 | NaN | Self | 2.5 | 0.0 | 0.0 | 165.333333 | 1.0 | 40.0 | NaN |
| 5468 | 1 | 32900 | NaN | 55657.000 | NaN | NaN | 1.0 | 0.0 | 0.0 | 210.012265 | 5.0 | 15.0 | 14.345367 |
| 5803 | 1 | 41600 | 159306.0 | NaN | NaN | Other | 0.0 | 0.0 | 0.0 | 160.333333 | 2.0 | 25.0 | NaN |
| 5826 | 1 | 41600 | 159306.0 | NaN | NaN | Other | 0.0 | 0.0 | 0.0 | 142.343205 | 2.0 | 25.0 | 14.345367 |
| 5900 | 1 | 41600 | 46126.0 | 83800.000 | NaN | Other | 28.0 | 0.0 | 0.0 | 339.900000 | 1.0 | 23.0 | NaN |
252 rows × 13 columns
#count the number of unique values
data['REASON'].value_counts()
DebtCon 3928 HomeImp 1780 Name: REASON, dtype: int64
DebtCon has the highest entries, hence the mode.
#imputing the missing values with the mode
value_to_fill = 'DebtCon' # Change this to your specific value
data['REASON'].fillna(value_to_fill, inplace=True)
#count the number of unique values
data['JOB'].value_counts()
Other 2388 ProfExe 1276 Office 948 Mgr 767 Self 193 Sales 109 Name: JOB, dtype: int64
Other has the highest entries, hence the mode.
#imputing the missing values with the mode
value_to_fill = 'Other' # Change this to your specific value
data['JOB'].fillna(value_to_fill, inplace=True)
# extracting all the information of other variables where REASON is null
data.loc[data['REASON'].isnull()==True]
| BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC |
|---|
# extracting all the information of other variables where JOB is null
data.loc[data['JOB'].isnull()==True]
| BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC |
|---|
Observation
The missing values for REASON andJOBhave been imputed.
# extracting all the information of other variables where CLNO is null
data.loc[data['CLNO'].isnull()==True]
| BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | 1 | 1500 | NaN | NaN | DebtCon | Other | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 10 | 1 | 2000 | 22608.0 | NaN | DebtCon | Other | 18.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 17 | 1 | 2200 | 23030.0 | NaN | DebtCon | Other | 19.0 | NaN | NaN | NaN | NaN | NaN | 14.345367 |
| 51 | 0 | 3100 | NaN | 70400.0 | DebtCon | Other | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 63 | 1 | 3600 | 61584.0 | 61800.0 | HomeImp | ProfExe | 10.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4680 | 0 | 24600 | NaN | 146804.0 | DebtCon | Other | NaN | NaN | NaN | NaN | NaN | NaN | 17.263535 |
| 4789 | 0 | 25100 | 85337.0 | 104607.0 | HomeImp | Other | 6.0 | NaN | NaN | NaN | NaN | NaN | 27.950475 |
| 4880 | 0 | 25600 | NaN | 147598.0 | DebtCon | Other | NaN | NaN | NaN | NaN | NaN | NaN | 14.461987 |
| 4899 | 0 | 25700 | 85417.0 | 98179.0 | HomeImp | Other | 7.0 | NaN | NaN | NaN | NaN | NaN | 30.829477 |
| 4947 | 0 | 26100 | NaN | 151429.0 | DebtCon | Other | NaN | NaN | NaN | NaN | NaN | NaN | 15.567001 |
222 rows × 13 columns
It looks like wherever CLNO is null, the data points in other columns are also missing. Let's check this.
data.loc[data['CLNO'].isnull()==True,'NINQ']
3 NaN
10 NaN
17 NaN
51 NaN
63 NaN
..
4680 NaN
4789 NaN
4880 NaN
4899 NaN
4947 NaN
Name: NINQ, Length: 222, dtype: float64
data.loc[data['CLNO'].isnull()==True,'NINQ'].value_counts(dropna=False)
NaN 222 Name: NINQ, dtype: int64
data.loc[data['CLNO'].isnull()==True,'CLAGE'].value_counts(dropna=False)
NaN 222 Name: CLAGE, dtype: int64
data.loc[data['CLNO'].isnull()==True,'DELINQ'].value_counts(dropna=False)
NaN 222 Name: DELINQ, dtype: int64
data.loc[data['CLNO'].isnull()==True,'DEROG'].value_counts(dropna=False)
NaN 222 Name: DEROG, dtype: int64
data.loc[data['CLNO'].isnull()==True,'YOJ'].value_counts(dropna=False)
NaN 117 0.00 14 8.00 14 7.00 8 6.00 8 3.00 8 9.00 7 10.00 7 2.00 5 5.00 5 4.00 5 19.00 4 1.00 4 13.00 3 2.90 1 3.50 1 21.00 1 15.00 1 8.50 1 0.75 1 0.30 1 24.00 1 23.00 1 20.00 1 12.00 1 18.00 1 0.80 1 Name: YOJ, dtype: int64
data.loc[data['CLNO'].isnull()==True,'MORTDUE'].value_counts(dropna=False)
NaN 81
31000.0 2
159306.0 2
83830.0 1
60667.0 1
..
77916.0 1
84666.0 1
80490.0 1
82972.0 1
85417.0 1
Name: MORTDUE, Length: 140, dtype: int64
data.loc[data['CLNO'].isnull()==True,'VALUE'].value_counts(dropna=False)
NaN 18
200447.375 2
80037.000 1
76965.000 1
123339.000 1
..
123613.000 1
93589.000 1
125912.000 1
32365.000 1
151429.000 1
Name: VALUE, Length: 204, dtype: int64
Observation
Missing value treatment for CLNO, NINQ, CLAGE, DEROG and DELINQ columns
# checking the counts of CLNO, NINQ, CLAGE, DEROG and DELINQ
data.groupby(['REASON','JOB'])[['CLNO','NINQ','CLAGE','DEROG','DELINQ']].value_counts()
REASON JOB CLNO NINQ CLAGE DEROG DELINQ
DebtCon Mgr 5.0 0.0 91.431005 0.0 0.0 1
22.0 0.0 102.972079 0.0 0.0 1
189.768060 0.0 0.0 1
188.140417 0.0 0.0 1
187.713486 0.0 0.0 1
..
HomeImp Self 21.0 1.0 178.578482 0.0 0.0 1
173.214284 0.0 0.0 1
169.343724 0.0 0.0 1
0.0 190.229251 0.0 0.0 1
42.5 3.0 255.415514 0.0 0.0 1
Length: 5017, dtype: int64
# checking the average number of CLNO, NINQ, CLAGE, DEROG and DELINQ
data.groupby(['REASON','JOB'])[['CLNO','NINQ','CLAGE','DEROG','DELINQ']].mean()
| CLNO | NINQ | CLAGE | DEROG | DELINQ | ||
|---|---|---|---|---|---|---|
| REASON | JOB | |||||
| DebtCon | Mgr | 23.354811 | 1.488246 | 175.284110 | 0.0 | 0.0 |
| Office | 21.961897 | 1.111290 | 175.058635 | 0.0 | 0.0 | |
| Other | 19.377332 | 1.275547 | 167.132922 | 0.0 | 0.0 | |
| ProfExe | 25.265258 | 1.049043 | 195.232533 | 0.0 | 0.0 | |
| Sales | 24.561856 | 0.697917 | 200.605195 | 0.0 | 0.0 | |
| Self | 22.089744 | 1.205128 | 171.511805 | 0.0 | 0.0 | |
| HomeImp | Mgr | 20.073529 | 0.768750 | 171.767755 | 0.0 | 0.0 |
| Office | 18.377517 | 0.451957 | 189.004531 | 0.0 | 0.0 | |
| Other | 17.311724 | 0.906475 | 181.177541 | 0.0 | 0.0 | |
| ProfExe | 22.068750 | 0.778351 | 191.634412 | 0.0 | 0.0 | |
| Sales | 25.625000 | 0.500000 | 151.220279 | 0.0 | 0.0 | |
| Self | 22.209821 | 1.425287 | 176.661493 | 0.0 | 0.0 |
Observation
We will use fillna() function and transform method of pandas to impute the missing values.
fillna() Function - The fillna() function is used to fill NaN values using the provide input value.
Syntax of fillna(): data['column'].fillna(value = x)
transform function - The transform() function works on each value of a DataFrame and allows to execute a specified function on each value.
Sytanx of transform function: data.transform(func = function name)
# imputing missing values in CLNO column
data['CLNO'] = data['CLNO'].fillna(value = data.groupby(['REASON','JOB'])['CLNO'].transform('mean'))
# imputing missing values in NINQ column
data['NINQ'] = data['NINQ'].fillna(value = data.groupby(['REASON','JOB'])['NINQ'].transform('mean'))
# imputing missing values in CLAGE column
data['CLAGE'] = data['CLAGE'].fillna(value = data.groupby(['REASON','JOB'])['CLAGE'].transform('mean'))
# imputing missing values in DEROG column
data['DEROG'] = data['DEROG'].fillna(value = data.groupby(['REASON','JOB'])['DEROG'].transform('mean'))
# imputing missing values in DELINQcolumn
data['DELINQ'] = data['DELINQ'].fillna(value = data.groupby(['REASON','JOB'])['DELINQ'].transform('mean'))
# checking if all the missing values were imputed in CLNO, NINQ, CLAGE, DEROG, and DELINQ
pd.DataFrame({'Count':data.isnull().sum()[data.isnull().sum()>0],'Percentage':(data.isnull().sum()[data.isnull().sum()>0]/data.shape[0])*100})
| Count | Percentage | |
|---|---|---|
| MORTDUE | 518 | 8.691275 |
| VALUE | 112 | 1.879195 |
| YOJ | 515 | 8.640940 |
| DEBTINC | 1267 | 21.258389 |
#change the datatypes from float to int
data['CLAGE'] = data['CLAGE'].round().astype(int)
data['DEROG'] = data['DEROG'].round().astype(int)
data['DELINQ'] = data['DELINQ'].round().astype(int)
data['NINQ'] = data['NINQ'].round().astype(int)
data['CLNO'] = data['CLNO'].round().astype(int)
data.dtypes
BAD int64 LOAN int64 MORTDUE float64 VALUE float64 REASON object JOB object YOJ float64 DEROG int64 DELINQ int64 CLAGE int64 NINQ int64 CLNO int64 DEBTINC float64 dtype: object
# checking if all the missing values were imputed in CLNO, NINQ, CLAGE, DEROG, and DELINQ
pd.DataFrame({'Count':data.isnull().sum()[data.isnull().sum()>0],'Percentage':(data.isnull().sum()[data.isnull().sum()>0]/data.shape[0])*100})
| Count | Percentage | |
|---|---|---|
| MORTDUE | 518 | 8.691275 |
| VALUE | 112 | 1.879195 |
| YOJ | 515 | 8.640940 |
| DEBTINC | 1267 | 21.258389 |
# extracting all the information of other variables where MORTDUE is null
data.loc[data['MORTDUE'].isnull()==True]
| BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | 1 | 1500 | NaN | NaN | DebtCon | Other | NaN | 0 | 0 | 167 | 1 | 19 | NaN |
| 9 | 1 | 2000 | NaN | 62250.0 | HomeImp | Sales | 16.0 | 0 | 0 | 116 | 0 | 13 | NaN |
| 24 | 1 | 2400 | NaN | 17180.0 | HomeImp | Other | NaN | 0 | 0 | 15 | 3 | 4 | NaN |
| 40 | 1 | 3000 | NaN | 8800.0 | HomeImp | Other | 2.0 | 0 | 0 | 78 | 0 | 3 | NaN |
| 41 | 1 | 3000 | NaN | 33000.0 | HomeImp | Other | 1.0 | 0 | 0 | 23 | 1 | 2 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5880 | 0 | 41600 | NaN | 84205.0 | HomeImp | Other | NaN | 0 | 0 | 340 | 0 | 7 | 22.639940 |
| 5883 | 0 | 41600 | NaN | 81322.0 | HomeImp | Self | 9.0 | 0 | 0 | 171 | 0 | 22 | 24.709060 |
| 5884 | 0 | 41600 | NaN | 91309.0 | HomeImp | Other | NaN | 0 | 0 | 350 | 0 | 6 | 22.061330 |
| 5930 | 1 | 41600 | NaN | 85000.0 | DebtCon | Other | 1.0 | 0 | 0 | 117 | 5 | 23 | NaN |
| 5931 | 1 | 41600 | NaN | 89609.0 | DebtCon | Other | 1.0 | 0 | 0 | 116 | 5 | 23 | 41.105569 |
518 rows × 13 columns
# Plotting the distribution of "MORTDUE"
sns.histplot(data['MORTDUE'], bins=30, kde=True)
plt.show()
# Checking the summary statistics
print(data['MORTDUE'].describe())
count 5442.000000 mean 71566.093752 std 37203.654400 min 2063.000000 25% 46276.000000 50% 65019.000000 75% 91488.000000 max 159306.000000 Name: MORTDUE, dtype: float64
sns.displot(data=data,x='MORTDUE',kind='kde')
plt.show()
# checking the average number of MORTDUE
data.groupby(['REASON','JOB'])[['MORTDUE']].mean()
| MORTDUE | ||
|---|---|---|
| REASON | JOB | |
| DebtCon | Mgr | 81412.883162 |
| Office | 67915.683200 | |
| Other | 61002.888087 | |
| ProfExe | 89500.977056 | |
| Sales | 81161.852273 | |
| Self | 101030.421053 | |
| HomeImp | Mgr | 73780.208025 |
| Office | 63644.349537 | |
| Other | 53643.771285 | |
| ProfExe | 89176.335821 | |
| Sales | 77174.818182 | |
| Self | 83019.442308 |
# imputing missing values in MORTDUE column
data['MORTDUE'] = data['MORTDUE'].fillna(value = data.groupby(['REASON','JOB'])['MORTDUE'].transform('mean'))
#check the distribution after imputation of missing values
sns.displot(data=data,x='MORTDUE',kind='kde')
plt.show()
Imputation of the mean didn't affect the overall distrution of the column
# checking if all the missing values were imputed in MORTDUE
pd.DataFrame({'Count':data.isnull().sum()[data.isnull().sum()>0],'Percentage':(data.isnull().sum()[data.isnull().sum()>0]/data.shape[0])*100})
| Count | Percentage | |
|---|---|---|
| VALUE | 112 | 1.879195 |
| YOJ | 515 | 8.640940 |
| DEBTINC | 1267 | 21.258389 |
Missing value treatment for VALUE
# Plotting the distribution of "VALUE"
sns.histplot(data['VALUE'], bins=30, kde=True)
plt.show()
# Checking the summary statistics
print(data['VALUE'].describe())
count 5848.000000 mean 98538.057633 std 45070.800236 min 8000.000000 25% 66075.500000 50% 89235.500000 75% 119824.250000 max 200447.375000 Name: VALUE, dtype: float64
As VALUE column seems to be normally distributed with little skeweness to the right, hence using average value for imputation might be the correct method
# checking the average number of VALUE
data.groupby(['REASON','JOB'])[['VALUE']].mean()
| VALUE | ||
|---|---|---|
| REASON | JOB | |
| DebtCon | Mgr | 108783.151033 |
| Office | 95515.978626 | |
| Other | 85908.692319 | |
| ProfExe | 121511.691065 | |
| Sales | 106791.007812 | |
| Self | 135942.748288 | |
| HomeImp | Mgr | 102486.195000 |
| Office | 90215.395926 | |
| Other | 75576.722381 | |
| ProfExe | 125971.866875 | |
| Sales | 81114.500000 | |
| Self | 137957.218478 |
# imputing missing values in VALUE column
data['VALUE'] = data['VALUE'].fillna(value = data.groupby(['REASON','JOB'])['VALUE'].transform('mean'))
#recheck the distribution after imputation of missing values
sns.displot(data=data,x='MORTDUE',kind='kde')
plt.show()
# checking if all the missing values were imputed in VALUE
pd.DataFrame({'Count':data.isnull().sum()[data.isnull().sum()>0],'Percentage':(data.isnull().sum()[data.isnull().sum()>0]/data.shape[0])*100})
| Count | Percentage | |
|---|---|---|
| YOJ | 515 | 8.640940 |
| DEBTINC | 1267 | 21.258389 |
Missing value treatment for YOJ
# Plotting the distribution of "YOJ"
sns.histplot(data['YOJ'], bins=30, kde=True)
plt.show()
# Checking the summary statistics
print(data['YOJ'].describe())
count 5445.000000 mean 8.873159 std 7.430914 min 0.000000 25% 3.000000 50% 7.000000 75% 13.000000 max 28.000000 Name: YOJ, dtype: float64
sns.displot(data=data,x='YOJ',kind='kde')
plt.show()
# checking the average number of YOJ
data.groupby(['REASON','JOB'])[['YOJ']].median()
| YOJ | ||
|---|---|---|
| REASON | JOB | |
| DebtCon | Mgr | 8.0 |
| Office | 5.0 | |
| Other | 7.0 | |
| ProfExe | 8.0 | |
| Sales | 4.0 | |
| Self | 5.0 | |
| HomeImp | Mgr | 8.0 |
| Office | 10.0 | |
| Other | 7.0 | |
| ProfExe | 7.0 | |
| Sales | 4.0 | |
| Self | 7.0 |
# imputing missing values in VALUE column
data['YOJ'] = data['YOJ'].fillna(value = data.groupby(['REASON','JOB'])['YOJ'].transform('median'))
#recheck the distribution after imputation of missing values
sns.displot(data=data,x='YOJ',kind='kde')
plt.show()
#change the datatype from float to int
data['YOJ'] = data['YOJ'].round().astype(int)
data.dtypes
BAD int64 LOAN int64 MORTDUE float64 VALUE float64 REASON object JOB object YOJ int64 DEROG int64 DELINQ int64 CLAGE int64 NINQ int64 CLNO int64 DEBTINC float64 dtype: object
# checking if all the missing values were imputed in VALUE
pd.DataFrame({'Count':data.isnull().sum()[data.isnull().sum()>0],'Percentage':(data.isnull().sum()[data.isnull().sum()>0]/data.shape[0])*100})
| Count | Percentage | |
|---|---|---|
| DEBTINC | 1267 | 21.258389 |
Missing Values treatment for DEBTINC
# Plotting the distribution of "YOJ"
sns.histplot(data['DEBTINC'], bins=30, kde=True)
plt.show()
# Checking the summary statistics
print(data['DEBTINC'].describe())
count 4693.000000 mean 33.681973 std 7.135236 min 14.345367 25% 29.140031 50% 34.818262 75% 39.003141 max 53.797805 Name: DEBTINC, dtype: float64
data['DEBTINC'].value_counts()
14.345367 59
53.797805 35
37.113614 1
41.824914 1
37.823173 1
..
35.512673 1
34.913793 1
41.077118 1
29.330142 1
34.571519 1
Name: DEBTINC, Length: 4601, dtype: int64
sns.displot(data=data,x='DEBTINC',kind='kde')
plt.show()
DEBTINC seems to be normally distributed, hence the mean can be used for imputation of missi ng values
# checking the average number of DEBTINC
data.groupby(['REASON','JOB'])[['DEBTINC']].mean()
| DEBTINC | ||
|---|---|---|
| REASON | JOB | |
| DebtCon | Mgr | 35.442766 |
| Office | 34.685394 | |
| Other | 33.616029 | |
| ProfExe | 32.581461 | |
| Sales | 35.133052 | |
| Self | 36.278200 | |
| HomeImp | Mgr | 34.639714 |
| Office | 33.537149 | |
| Other | 33.352517 | |
| ProfExe | 31.879843 | |
| Sales | 39.301879 | |
| Self | 32.907360 |
# imputing missing values in VALUE column
data['DEBTINC'] = data['DEBTINC'].fillna(value = data.groupby(['REASON','JOB'])['DEBTINC'].transform('mean'))
sns.displot(data=data,x='DEBTINC',kind='kde')
plt.show()
data.isnull().sum()
BAD 0 LOAN 0 MORTDUE 0 VALUE 0 REASON 0 JOB 0 YOJ 0 DEROG 0 DELINQ 0 CLAGE 0 NINQ 0 CLNO 0 DEBTINC 0 dtype: int64
All the missing values have been treated and no missing values in our data, hence we can proceed to make our prediction model
What are the the most important observations and insights from the data based on the EDA performed?
From the EDA performed, we observed that
other, its possible that most of them were unemployed and this could explain why majority of the loan defaultment came from this category.# saving the dataset with all the missing values treated
data.to_csv('/content/drive/MyDrive/Data science MIT/Capstone project/hmeq2.csv',index=False)
Creating dummy variables for the categorical variables
#check if there is balance distribution in the target variable
data['BAD'].value_counts(normalize=True)
0 0.800503 1 0.199497 Name: BAD, dtype: float64
There is inbalance distribution between class 0 and class 1, from the above ~80% of the appliciant/client did not default on loan repayment while ~20 defaulted in the loan payment
#check the categorical column
cat_cols
['REASON', 'JOB']
# Creating a list of columns for which we will create dummy variables
to_get_dummies_for = ['REASON', 'JOB']
# Creating dummy variables
df = pd.get_dummies(data = data, columns = to_get_dummies_for, drop_first = True)
df.head()
| BAD | LOAN | MORTDUE | VALUE | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | REASON_HomeImp | JOB_Office | JOB_Other | JOB_ProfExe | JOB_Sales | JOB_Self | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1100 | 25860.000000 | 39025.000000 | 10 | 0 | 0 | 94 | 1 | 9 | 33.352517 | 1 | 0 | 1 | 0 | 0 | 0 |
| 1 | 1 | 1300 | 70053.000000 | 68400.000000 | 7 | 0 | 0 | 122 | 0 | 14 | 33.352517 | 1 | 0 | 1 | 0 | 0 | 0 |
| 2 | 1 | 1500 | 13500.000000 | 16700.000000 | 4 | 0 | 0 | 149 | 1 | 10 | 33.352517 | 1 | 0 | 1 | 0 | 0 | 0 |
| 3 | 1 | 1500 | 61002.888087 | 85908.692319 | 7 | 0 | 0 | 167 | 1 | 19 | 33.616029 | 0 | 0 | 1 | 0 | 0 | 0 |
| 4 | 0 | 1700 | 97800.000000 | 112000.000000 | 3 | 0 | 0 | 93 | 0 | 14 | 33.537149 | 1 | 1 | 0 | 0 | 0 | 0 |
Preparing data for modeling Any other preprocessing steps (if needed)
Separating the independent variables (X) and the dependent variable (Y)
# Separating independent variables and the target variable
x = df.drop('BAD',axis=1)
y = df['BAD']
Scaling the data
The independent variables in this dataset have different scales. When features have different scales from each other, there is a chance that a higher weightage will be given to features that have a higher magnitude, and they will dominate over other features whose magnitude changes may be smaller but whose percentage changes may be just as significant or even larger. This will impact the performance of our machine learning algorithm, and we do not want our algorithm to be biased towards one feature.
The solution to this issue is Feature Scaling, i.e. scaling the dataset so as to give every transformed variable a comparable scale.
In this problem, we will use the Standard Scaler method, which centers and scales the dataset using the Z-Score.
It standardizes features by subtracting the mean and scaling it to have unit variance.
The standard score of sample x is calculated as:
z = (x - u) / s
where u is the mean of the training samples (zero) and s is the standard deviation of the training samples.
# Scaling the data
sc = StandardScaler()
x_scaled = sc.fit_transform(x)
x_scaled = pd.DataFrame(x_scaled, columns = x.columns)
Splitting the data into 70% train and 30% test set
# Splitting the data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 1, stratify = y)
# Checking the shape of the Train and Test sets
print('X Train Shape:', x_train.shape);
print('X Test Shape:', x_test.shape);
print('Y Train Shape:', y_train.shape);
print('Y Test Shape:', y_test.shape);
X Train Shape: (4172, 16) X Test Shape: (1788, 16) Y Train Shape: (4172,) Y Test Shape: (1788,)
# Creating metric function
def metrics_score(actual, predicted):
print(classification_report(actual, predicted))
cm = confusion_matrix(actual, predicted)
plt.figure(figsize = (8, 5))
sns.heatmap(cm, annot = True, fmt = '.2f', xticklabels = ['0', '1'], yticklabels = ['0', '1'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
def model_performance_classification(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# Predicting using the independent variables
pred = model.predict(predictors)
recall = recall_score(target, pred,average = 'macro') # To compute recall
precision = precision_score(target, pred, average = 'macro') # To compute precision
acc = accuracy_score(target, pred) # To compute accuracy score
# Creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Precision": precision,
"Recall": recall,
"Accuracy": acc,
},
index = [0],
)
return df_perf
# Fitting the logistic regression model
lg = LogisticRegression()
lg.fit(x_train,y_train)
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression()
Checking the model performance
# Checking the performance on the training data
y_pred_train = lg.predict(x_train)
metrics_score(y_train, y_pred_train)
precision recall f1-score support
0 0.80 1.00 0.89 3340
1 0.00 0.00 0.00 832
accuracy 0.80 4172
macro avg 0.40 0.50 0.44 4172
weighted avg 0.64 0.80 0.71 4172
# Checking the performance on the test dataset
y_pred_test = lg.predict(x_test)
metrics_score(y_test, y_pred_test)
precision recall f1-score support
0 0.80 1.00 0.89 1431
1 0.00 0.00 0.00 357
accuracy 0.80 1788
macro avg 0.40 0.50 0.44 1788
weighted avg 0.64 0.80 0.71 1788
lg_test = model_performance_classification(lg,x_test,y_test)
lg_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.400168 | 0.5 | 0.800336 |
Observation
# Fitting the logistic regression model with the target
lg_2 = LogisticRegression(class_weight='balanced')
lg_2.fit(x_train,y_train)
LogisticRegression(class_weight='balanced')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression(class_weight='balanced')
# Checking the performance on the training data
y_pred_train = lg_2.predict(x_train)
metrics_score(y_train, y_pred_train)
precision recall f1-score support
0 0.88 0.60 0.72 3340
1 0.29 0.66 0.41 832
accuracy 0.62 4172
macro avg 0.59 0.63 0.56 4172
weighted avg 0.76 0.62 0.65 4172
# Checking the performance on the test dataset
y_pred_test = lg_2.predict(x_test)
metrics_score(y_test, y_pred_test)
precision recall f1-score support
0 0.88 0.58 0.70 1431
1 0.29 0.68 0.41 357
accuracy 0.60 1788
macro avg 0.58 0.63 0.55 1788
weighted avg 0.76 0.60 0.64 1788
lg_test_2 = model_performance_classification(lg_2,x_test,y_test)
lg_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.400168 | 0.5 | 0.800336 |
Obsservation
Although we observed a bit improvement in the prediction of class 1 in both the train and test set, there was no improvement in the overall performance of the model.
# Finding the correlation between various columns of the dataset
plt.figure(figsize = (15,7))
sns.heatmap(df.corr(), annot = True, vmin = -1, vmax = 1, fmt = ".2f", cmap = "Spectral")
<Axes: >
There is strong correlation between VALUE and MORTDUE
# Define the hyperparameters and their possible values
param_grid = {
'C': [0.001, 0.01, 0.1, 1, 10, 100],
'penalty': ['l1', 'l2', 'elasticnet'],
'solver': ['liblinear', 'newton-cg', 'lbfgs', 'sag', 'saga'],
'class_weight': [None, 'balanced']
}
# Create a logistic regression model
lg_3 = LogisticRegression()
# Use GridSearchCV to find the best hyperparameters
grid_search = GridSearchCV(lg_3, param_grid, cv=5, scoring='accuracy') # you can adjust the scoring metric
grid_search.fit(x_train, y_train)
# Print the best parameters
print(grid_search.best_params_)
{'C': 0.01, 'class_weight': None, 'penalty': 'l2', 'solver': 'newton-cg'}
lg_3 = LogisticRegression(C =0.01, class_weight= None, penalty= 'l2', solver= 'newton-cg')
lg_3.fit(x_train,y_train)
LogisticRegression(C=0.01, solver='newton-cg')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression(C=0.01, solver='newton-cg')
# Checking the performance on the training data
y_pred_train = lg_3.predict(x_train)
metrics_score(y_train, y_pred_train)
precision recall f1-score support
0 0.81 0.99 0.89 3340
1 0.72 0.05 0.10 832
accuracy 0.81 4172
macro avg 0.76 0.52 0.49 4172
weighted avg 0.79 0.81 0.73 4172
# Checking the performance on the test dataset
y_pred_test = lg_3.predict(x_test)
metrics_score(y_test, y_pred_test)
precision recall f1-score support
0 0.81 1.00 0.89 1431
1 0.68 0.04 0.08 357
accuracy 0.80 1788
macro avg 0.74 0.52 0.48 1788
weighted avg 0.78 0.80 0.73 1788
lg_test_3 = model_performance_classification(lg_3,x_test,y_test)
lg_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.400168 | 0.5 | 0.800336 |
From the above model, there was improvement in precision for class 1 in both the train and test model. However, the recall and F1 score reduced.
# Printing the coefficients of logistic regression
cols = x.columns
coef_lg = lg_2.coef_
pd.DataFrame(coef_lg,columns = cols).T.sort_values(by = 0, ascending = False)
| 0 | |
|---|---|
| DEBTINC | 0.043204 |
| NINQ | 0.014584 |
| CLNO | 0.005580 |
| REASON_HomeImp | 0.001890 |
| JOB_Other | 0.000783 |
| JOB_Self | 0.000695 |
| JOB_Sales | 0.000503 |
| VALUE | 0.000003 |
| DEROG | 0.000000 |
| DELINQ | 0.000000 |
| MORTDUE | -0.000007 |
| LOAN | -0.000019 |
| JOB_ProfExe | -0.000775 |
| JOB_Office | -0.001471 |
| CLAGE | -0.005688 |
| YOJ | -0.014565 |
This table displays the coefficients of different features from a logistic regression model. Let's interpret these coefficients:
Positive Coefficients:
Zero Coefficients:
Negative Coefficients:
Key Takeaways:
odds = np.exp(lg_2.coef_[0]) # Finding the odds
# Adding the odds to a DataFrame and sorting the values
pd.DataFrame(odds, x_train.columns, columns = ['odds']).sort_values(by = 'odds', ascending = False)
| odds | |
|---|---|
| DEBTINC | 1.044151 |
| NINQ | 1.014691 |
| CLNO | 1.005595 |
| REASON_HomeImp | 1.001892 |
| JOB_Other | 1.000784 |
| JOB_Self | 1.000696 |
| JOB_Sales | 1.000503 |
| VALUE | 1.000003 |
| DEROG | 1.000000 |
| DELINQ | 1.000000 |
| MORTDUE | 0.999993 |
| LOAN | 0.999981 |
| JOB_ProfExe | 0.999225 |
| JOB_Office | 0.998530 |
| CLAGE | 0.994328 |
| YOJ | 0.985540 |
Positive Coefficients Comparing the results of the coeffients with the odds, we can see the same pattern.
Negative Coefficients
Also, both indicate a decrease in the likelihood of the positive class as the predictor increases. For example, YOJ had a negative coefficient, and its odds ratio is 0.985540. This means for each unit increase in YOJ, the odds of the positive class decrease by about 1.5%.
However, features like DEROG and DELINQ have coefficients close to 0, and their odds ratios are close to 1, indicating they might not be influential in this model.
The Precision-Recall Curve for Logistic Regression
y_scores_lg = lg_2.predict_proba(x_train) # predict_proba gives the probability of each observation belonging to each class
precisions_lg, recalls_lg, thresholds_lg = precision_recall_curve(y_train, y_scores_lg[:, 1])
# Plot values of precisions, recalls, and thresholds
plt.figure(figsize = (10, 7))
plt.plot(thresholds_lg, precisions_lg[:-1], 'b--', label = 'precision')
plt.plot(thresholds_lg, recalls_lg[:-1], 'g--', label = 'recall')
plt.xlabel('Threshold')
plt.legend(loc = 'upper left')
plt.ylim([0, 1])
plt.show()
Let's find out the performance of the model at this threshold.
optimal_threshold1 = .61
y_pred_train = lg_2.predict_proba(x_train)
metrics_score(y_train, y_pred_train[:, 1] > optimal_threshold1)
precision recall f1-score support
0 0.84 0.85 0.84 3340
1 0.36 0.35 0.36 832
accuracy 0.75 4172
macro avg 0.60 0.60 0.60 4172
weighted avg 0.74 0.75 0.75 4172
optimal_threshold1 = .61
y_pred_test = lg_2.predict_proba(x_test)
metrics_score(y_test, y_pred_test[:, 1] > optimal_threshold1)
precision recall f1-score support
0 0.84 0.85 0.85 1431
1 0.37 0.34 0.35 357
accuracy 0.75 1788
macro avg 0.60 0.60 0.60 1788
weighted avg 0.74 0.75 0.75 1788
Observation
# Building decision tree model
dt = DecisionTreeClassifier(random_state = 1)
# Fitting decision tree model
dt.fit(x_train, y_train)
DecisionTreeClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(random_state=1)
Let's check the model performance of decision tree
# Checking performance on the training dataset
y_train_pred_dt = dt.predict(x_train)
metrics_score(y_train, y_train_pred_dt)
precision recall f1-score support
0 1.00 1.00 1.00 3340
1 1.00 1.00 1.00 832
accuracy 1.00 4172
macro avg 1.00 1.00 1.00 4172
weighted avg 1.00 1.00 1.00 4172
Observation:
# Checking performance on the test dataset
y_test_pred_dt = dt.predict(x_test)
metrics_score(y_test, y_test_pred_dt)
precision recall f1-score support
0 0.90 0.91 0.90 1431
1 0.62 0.61 0.61 357
accuracy 0.85 1788
macro avg 0.76 0.76 0.76 1788
weighted avg 0.85 0.85 0.85 1788
dtree_test = model_performance_classification(dt,x_test,y_test)
dtree_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.761261 | 0.758502 | 0.847315 |
Observations:
# Plot the feature importance
importances = dt.feature_importances_
columns = x.columns
importance_df = pd.DataFrame(importances, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)
plt.figure(figsize = (13, 13))
#sns.barplot(importance_df.Importance,importance_df.index)
sns.barplot(x=importance_df.Importance, y=importance_df.index)
<Axes: xlabel='Importance'>
Observation
features = list(x.columns)
plt.figure(figsize = (30, 20))
tree.plot_tree(dt, max_depth = 4, feature_names = features, filled = True, fontsize = 12, node_ids = True, class_names = True)
plt.show()
Blue leaves represent the leads, i.e., y[1] and the orange leaves represent the non-leads, i.e., y[0]. Also, the more the number of observations in a leaf, the darker its color gets.
Criterion {“gini”, “entropy”}
The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.
max_depth
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
min_samples_leaf
The minimum number of samples is required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
You can learn about more Hyperpapameters on this link and try to tune them.
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
# check if there is imbalance in the target variable
data['BAD'].value_counts(1)
0 0.800503 1 0.199497 Name: BAD, dtype: float64
There is imbalance in the target variable
# Choose the type of classifier
dtree_estimator = DecisionTreeClassifier(class_weight = {0: 0.20, 1: 0.80}, random_state = 1)
# Grid of parameters to choose from
parameters = {'max_depth': np.arange(2, 8),
'criterion': ['gini', 'entropy'],
'min_samples_leaf': [5, 10, 20, 25]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(recall_score, pos_label = 1)
# Run the grid search
gridCV = GridSearchCV(dtree_estimator, parameters, scoring = scorer, cv = 10)
# Fitting the grid search on the train data
gridCV = gridCV.fit(x_train, y_train)
# Set the classifier to the best combination of parameters
dtree_estimator = gridCV.best_estimator_
# Fit the best estimator to the data
dtree_estimator.fit(x_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8}, criterion='entropy',
max_depth=5, min_samples_leaf=5, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8}, criterion='entropy',
max_depth=5, min_samples_leaf=5, random_state=1)# Checking performance on the training dataset
y_train_pred_dt = dtree_estimator.predict(x_train)
metrics_score(y_train, y_train_pred_dt)
precision recall f1-score support
0 0.94 0.73 0.82 3340
1 0.43 0.81 0.56 832
accuracy 0.74 4172
macro avg 0.68 0.77 0.69 4172
weighted avg 0.84 0.74 0.77 4172
Observation:
# Checking performance on the test dataset
y_test_pred_dt = dtree_estimator.predict(x_test)
metrics_score(y_test, y_test_pred_dt)
precision recall f1-score support
0 0.93 0.72 0.81 1431
1 0.42 0.79 0.55 357
accuracy 0.74 1788
macro avg 0.67 0.76 0.68 1788
weighted avg 0.83 0.74 0.76 1788
dtree_tuned_test = model_performance_classification(dtree_estimator,x_test,y_test)
dtree_tuned_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.674359 | 0.757295 | 0.736018 |
Observation
importances = dtree_estimator.feature_importances_
columns = x.columns
importance_df = pd.DataFrame(importances, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)
plt.figure(figsize = (13, 13))
sns.barplot(x=importance_df.Importance, y=importance_df.index);
Observation
features = list(x.columns)
plt.figure(figsize = (30, 20))
tree.plot_tree(dtree_estimator, max_depth = 4, feature_names = features, filled = True, fontsize = 12, node_ids = True, class_names = True)
plt.show()
Observation
Random Forest is a bagging algorithm where the base models are Decision Trees. Samples are taken from the training data and on each sample a decision tree makes a prediction.
The results from all the decision trees are combined together and the final prediction is made using voting or averaging.
# Fitting the Random Forest classifier on the training data
rf_estimator = RandomForestClassifier( random_state = 1)
rf_estimator.fit(x_train, y_train)
RandomForestClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(random_state=1)
# Checking performance on the training data
y_pred_train_rf = rf_estimator.predict(x_train)
metrics_score(y_train, y_pred_train_rf)
precision recall f1-score support
0 1.00 1.00 1.00 3340
1 1.00 1.00 1.00 832
accuracy 1.00 4172
macro avg 1.00 1.00 1.00 4172
weighted avg 1.00 1.00 1.00 4172
Observation:
# Checking performance on the testing data
y_pred_test_rf = rf_estimator.predict(x_test)
metrics_score(y_test, y_pred_test_rf)
precision recall f1-score support
0 0.90 0.98 0.94 1431
1 0.87 0.54 0.67 357
accuracy 0.89 1788
macro avg 0.88 0.76 0.80 1788
weighted avg 0.89 0.89 0.88 1788
rf_estimator_test = model_performance_classification(rf_estimator,x_test,y_test)
rf_estimator_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.884894 | 0.761925 | 0.893177 |
Observations:
importances = rf_estimator.feature_importances_
columns = x.columns
importance_df = pd.DataFrame(importances, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)
plt.figure(figsize = (13, 13))
sns.barplot(x=importance_df.Importance, y=importance_df.index);
DEBTINC was also the most important feature with the default random forest model.
Tuning the Random Forest classifier
# Choose the type of classifier
rf_estimator_tuned = RandomForestClassifier(class_weight = {0: 0.2, 1: 0.8}, random_state = 1)
# Grid of parameters to choose from
params_rf = {
"n_estimators": [100, 250, 500],
"min_samples_leaf": np.arange(1, 4, 1),
"max_features": [0.7, 0.9, 'auto'],
}
# Type of scoring used to compare parameter combinations - recall score for class 1
scorer = metrics.make_scorer(recall_score, pos_label = 1)
# Run the grid search
grid_obj = GridSearchCV(rf_estimator_tuned, params_rf, scoring = scorer, cv = 5)
grid_obj = grid_obj.fit(x_train, y_train)
# Set the classifier to the best combination of parameters
rf_estimator_tuned = grid_obj.best_estimator_
rf_estimator_tuned.fit(x_train, y_train)
RandomForestClassifier(class_weight={0: 0.2, 1: 0.8}, max_features='auto',
min_samples_leaf=3, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomForestClassifier(class_weight={0: 0.2, 1: 0.8}, max_features='auto',
min_samples_leaf=3, random_state=1)# Checking performance on the training data
y_pred_train_rf_tuned = rf_estimator_tuned.predict(x_train)
metrics_score(y_train, y_pred_train_rf_tuned)
precision recall f1-score support
0 1.00 0.99 0.99 3340
1 0.95 0.99 0.97 832
accuracy 0.99 4172
macro avg 0.98 0.99 0.98 4172
weighted avg 0.99 0.99 0.99 4172
Observation
# Checking performance on the test data
y_pred_test_rf_tuned = rf_estimator_tuned.predict(x_test)
metrics_score(y_test, y_pred_test_rf_tuned)
precision recall f1-score support
0 0.92 0.96 0.94 1431
1 0.79 0.64 0.71 357
accuracy 0.90 1788
macro avg 0.85 0.80 0.82 1788
weighted avg 0.89 0.90 0.89 1788
rf_estimator_tuned_test = model_performance_classification(rf_estimator_tuned, x_test, y_test)
rf_estimator_tuned_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.854162 | 0.801164 | 0.895414 |
importances = rf_estimator_tuned.feature_importances_
columns = x.columns
importance_df = pd.DataFrame(importances, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)
plt.figure(figsize = (13, 13))
sns.barplot(x=importance_df.Importance, y=importance_df.index);
DEBTINC was also the most important feature with the default random forest model.
Observations:
# Installing the xgboost library using the 'pip' command.
!pip install xgboost
Requirement already satisfied: xgboost in /usr/local/lib/python3.10/dist-packages (2.0.0) Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from xgboost) (1.23.5) Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from xgboost) (1.11.3)
# Importing the AdaBoostClassifier and GradientBoostingClassifier [Boosting]
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
# Importing the XGBClassifier from the xgboost library
from xgboost import XGBClassifier
# Adaboost Classifier
adaboost_model = AdaBoostClassifier(random_state = 1)
# Fitting the model
adaboost_model.fit(x_train, y_train)
# Model Performance on the test data
adaboost_model_perf_test = model_performance_classification(adaboost_model,x_test,y_test)
adaboost_model_perf_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.81559 | 0.676151 | 0.853468 |
importances = adaboost_model.feature_importances_
columns = x.columns
importance_df = pd.DataFrame(importances, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)
plt.figure(figsize = (13, 13))
sns.barplot(x=importance_df.Importance, y=importance_df.index);
DEBTINC was also the most important feature with the default random forest model.
# Gradient Boost Classifier
gbc = GradientBoostingClassifier(random_state = 1)
# Fitting the model
gbc.fit(x_train, y_train)
# Model Performance on the test data
gbc_perf_test = model_performance_classification(gbc, x_test, y_test)
gbc_perf_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.853519 | 0.725517 | 0.87528 |
importances = gbc.feature_importances_
columns = x.columns
importance_df = pd.DataFrame(importances, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)
plt.figure(figsize = (13, 13))
sns.barplot(x=importance_df.Importance, y=importance_df.index);
DEBTINC was also the most important feature with the default random forest model.
# XGBoost Classifier
xgb = XGBClassifier(random_state = 1, eval_metric = 'logloss')
# Fitting the model
xgb.fit(x_train,y_train)
# Model Performance on the test data
xgb_perf_test = model_performance_classification(xgb,x_test,y_test)
xgb_perf_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.866014 | 0.792752 | 0.897092 |
importances = xgb.feature_importances_
columns = x.columns
importance_df = pd.DataFrame(importances, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)
plt.figure(figsize = (13, 13))
sns.barplot(x=importance_df.Importance, y=importance_df.index);
DEBTINC was also the most important feature with the default random forest model.
models_test_comp_df = pd.concat(
[
lg_test_2.T, lg_test_3.T, dtree_test.T, dtree_tuned_test.T,rf_estimator_test.T,
rf_estimator_tuned_test.T, adaboost_model_perf_test.T,
gbc_perf_test.T, xgb_perf_test.T
],
axis = 1,
)
models_test_comp_df.columns = [
"Logistic Regression",
"Tuned Logistic regression",
"Decision Tree classifier",
"Tuned Decision Tree classifier",
"Random Forest classifier",
"Tuned Random Forest classifier",
"Adaboost classifier",
"Gradientboost classifier",
"XGBoost classifier"
]
print("Test performance comparison:")
models_test_comp_df
Test performance comparison:
| Logistic Regression | Tuned Logistic regression | Decision Tree classifier | Tuned Decision Tree classifier | Random Forest classifier | Tuned Random Forest classifier | Adaboost classifier | Gradientboost classifier | XGBoost classifier | |
|---|---|---|---|---|---|---|---|---|---|
| Precision | 0.584698 | 0.744080 | 0.761261 | 0.674359 | 0.884894 | 0.854162 | 0.815590 | 0.853519 | 0.866014 |
| Recall | 0.632093 | 0.518563 | 0.758502 | 0.757295 | 0.761925 | 0.801164 | 0.676151 | 0.725517 | 0.792752 |
| Accuracy | 0.601230 | 0.804810 | 0.847315 | 0.736018 | 0.893177 | 0.895414 | 0.853468 | 0.875280 | 0.897092 |
Observations:
1. Comparison of various techniques and their relative performance based on chosen Metric (Measure of success):
2. Refined insights:
3. Proposal for the final solution design:
DEBTINC, CLAGE, LOAN,MORTDUE, and VALUE are the most important features. Hence, should be the importance features to consider when approving loans-Given the importance of correctly identifying bad loans, focusing on recall is crucial. While the Tuned Random Forest model showed promise, there's still room for improvement
More resources should be allocated to determine the DEBTINC of each client before approving loans.
There was little impact of DEROG and DELINQ on loan defaultment. Hence the amount of resources allocated for checking this features should be reduced.